Revisit handling of images processing and other fixes #2143

benoit74 · 2025-01-28T10:26:25Z

This is kinda a significant PR to fix many issues around images processing.

Fix #2140
Fix #2136
Fix #2088
Fix #2138

Changes

when processing HTML, make a distinction between images, videos and other medias (PDF, ...)
- store this information in redis for later retrieval
- do not guess if a given media is an image based on its URL anymore
- do not guess content-type from image URL or from response header, compute it with file-types based on real image data
- remove corresponding functions isImageUrl, getMimeType and constants: IMAGE_URL_REGEX
- remove now useless mime-type package
- for now, videos do not have any special treatment (e.g. reencoding) but everything is ready for that
- other medias are always simply downloaded
do not add a .webp suffix to the path of images which have been converted to webp
- as mentioned in Logic to set .webp path prefix on reencoded images is skewed #2140, and as observed in Do not rely on URL filename extension to detect images #2088, we cannot have this information at HTML processing time
- not having proper extension in ZIM path has no consequence
- this allows to also convert images referenced in CSS stylesheets to webp without having to worry about this
stop pushing content-type to S3 metadata
- we do not need this information anymore
- there are too many risks this information is wrong due to a bug
- we can let things already in S3 with this metadata live as they are, there is mostly 0 consequences
define a clear API of information returned by downloader.downloadContent when downloading content, instead of the whole response upstream (which could contain "anything")

kelson42 · 2025-01-28T10:40:48Z

@benoit74 Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO. Current approach works (modulo bugs - like always) and if we really want to do better we should keep track about the content mime-type (instead of relying on the extension).

benoit74 · 2025-01-28T10:51:11Z

Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO.

I agree, but this would mean a significant redesign of the scraper: with current architecture, as stated in the issue, we cannot know at HTML rewriting time what the result of image download/conversion will be ; for this we need to download the image and try the reencoding, which is currently done at a totally different stage.

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

benoit74 · 2025-01-28T10:59:08Z

Sample ZIMs:

wikipedia_hi_basketball_maxi_2025-01.zim : second run on dev S3 bucket (i.e. all images are coming from the S3 cache)
psychonautwiki_en_all_maxi_2025-01.zim : first run on dev S3 bucket (i.e. all images are coming from online)

kelson42 · 2025-01-28T12:23:04Z

For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it.

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

benoit74 · 2025-01-28T12:29:31Z

It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK.

S3 key is computed from online URL directly without any logic handling webp conversion:

mwoffliner/src/Downloader.ts

Line 591 in fc2af69

    
           await this.s3.uploadBlob(stripHttpFromUrl(url), mwResp.data, etag, mwResp.headers['content-type'], this.webp ? 'webp' : '1')

codecov · 2025-01-30T11:04:16Z

Codecov Report

Attention: Patch coverage is 75.55556% with 33 lines in your changes missing coverage. Please review.

Project coverage is 75.19%. Comparing base (75cf0fe) to head (88bbeec).

Files with missing lines	Patch %	Lines
src/Downloader.ts	53.06%	19 Missing and 4 partials ⚠️
src/mwoffliner.lib.ts	60.00%	2 Missing ⚠️
src/renderers/abstract.renderer.ts	94.73%	2 Missing ⚠️
src/util/misc.ts	0.00%	2 Missing ⚠️
src/MediaWiki.ts	66.66%	1 Missing ⚠️
src/S3.ts	0.00%	1 Missing ⚠️
src/util/articleListMainPage.ts	0.00%	1 Missing ⚠️
src/util/saveArticles.ts	95.83%	1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (75.55%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
- Coverage   75.98%   75.19%   -0.79%     
==========================================
  Files          41       41              
  Lines        3202     3213      +11     
  Branches      706      704       -2     
==========================================
- Hits         2433     2416      -17     
- Misses        655      676      +21     
- Partials      114      121       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74 self-assigned this Jan 28, 2025

benoit74 force-pushed the images_processing branch from 8ab34d2 to 8cf2027 Compare January 28, 2025 10:36

benoit74 force-pushed the images_processing branch from 8cf2027 to 35b686a Compare January 28, 2025 12:26

benoit74 force-pushed the images_processing branch from 35b686a to c413004 Compare January 30, 2025 10:10

Revisit handling of images processing and other fixes

88bbeec

benoit74 force-pushed the images_processing branch from c413004 to 88bbeec Compare January 30, 2025 10:35

benoit74 marked this pull request as ready for review January 30, 2025 11:31

benoit74 requested a review from kelson42 January 30, 2025 11:32

benoit74 mentioned this pull request Jan 30, 2025

Pre-install all Node.JS dependencies to make image smaller/faster #2148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit handling of images processing and other fixes #2143

Revisit handling of images processing and other fixes #2143

benoit74 commented Jan 28, 2025 •

edited

Loading

kelson42 commented Jan 28, 2025 •

edited

Loading

benoit74 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

kelson42 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

codecov bot commented Jan 30, 2025

Revisit handling of images processing and other fixes #2143

Are you sure you want to change the base?

Revisit handling of images processing and other fixes #2143

Conversation

benoit74 commented Jan 28, 2025 • edited Loading

Changes

kelson42 commented Jan 28, 2025 • edited Loading

benoit74 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

kelson42 commented Jan 28, 2025

benoit74 commented Jan 28, 2025

codecov bot commented Jan 30, 2025

Codecov Report

benoit74 commented Jan 28, 2025 •

edited

Loading

kelson42 commented Jan 28, 2025 •

edited

Loading